KUL H02A5a Computer Vision: Group Assignment 1

The goal of this assignment is to explore more advanced techniques for constructing features that better describe objects of interest and to perform face recognition using these features. This assignment will be delivered in groups of 5 (either composed by you or randomly assigned by your TA's).

In this assignment you are a group of computer vision experts that have been invited to ECCV 2021 to do a tutorial about "Feature representations, then and now". To prepare the tutorial you are asked to participate in a kaggle competition and to release a notebook that can be easily studied by the tutorial participants. Your target audience is: (master) students who want to get a first hands-on introduction to the techniques that you apply.


This notebook is structured as follows:

  1. Data loading & Preprocessing
  2. Feature Representations
  3. Evaluation Metrics
  4. Classifiers
  5. Experiments
  6. Publishing best results
  7. Discussion

Make sure that your notebook is self-contained and fully documented. Walk us through all steps of your code. Treat your notebook as a tutorial for students who need to get a first hands-on introduction to the techniques that you apply. Provide strong arguments for the design choices that you made and what insights you got from your experiments. Make use of the Group assignment forum/discussion board on Toledo if you have any questions.


0. Data loading & Preprocessing

0.1. Loading data

The training set is many times smaller than the test set and this might strike you as odd, however, this is close to a real world scenario where your system might be put through daily use! In this session we will try to do the best we can with the data that we've got!

Note: this dataset is a subset of the VGG face dataset.

0.2. A first look

Let's have a look at the data columns and class distribution.

Note that Jesse is assigned the classification label 1, and Mila is assigned the classification label 2. The dataset also contains 20 images of look alikes (assigned classification label 0) and the raw images.

0.3. Preprocess data

0.3.1 Example: HAAR face detector

In this example we use the HAAR feature based cascade classifiers to detect faces, then the faces are resized so that they all have the same shape. If there are multiple faces in an image, we only take the first one.

NOTE: You can write temporary files to /kaggle/temp/ or ../../tmp, but they won't be saved outside of the current session

Let's define a function to plot the sequence of images

We implement HAARPreprocessor on our training and testing data

We use our plotting function to visualize our processed images

This will allow use to see if faces were extracted well from our raw images

As we can see above, the faces extracted using HAARPreprocessor are not ideal

We can see 3 problems:

Let's create a class to plot the raw images so we can visualize them and see what is the problem

We can see that in some images, there is more than one person and HAARProcessor only extracts the first face (faces[0])

We therefore should try to improve our HAARProcessor

We will create HAARProccesor_V2_train to extract all faces from the training images

For the images in which we have more than one face, we will extract all of them and give them the class label of the image

We can later correct the class of those faces in later stages

We also improve the detection of images (in detect_faces()) by trying different parameters if no face was detected in order to avoid extracting no face at all

We implement our improved HAARProcessor

We plot the images extracted with the improved HAARProcessor and we see that we extracted more faces

We create a different improved HAARPreprocessor class for testing data HAARPreprocessor_V2_test

It also improves the detection of images (in detect_faces()) by trying different parameters if no face was detected in order to avoid extracting no face at all

However, it only keeps one face per image (first face detected) so that we keep the same size of the testing faces as required (1816). This is not ideal but we can try to improve it later

We implement our improved HAARProcessor on the test data

1. Feature Representations

1.0. Example: Identify feature extractor

Our example feature extractor doesn't actually do anything... It just returns the input: $$ \forall x : f(x) = x. $$

It does make for a good placeholder and baseclass ;).

1.1. Baseline 1: HOG feature extractor/Scale Invariant Feature Transform

1.1.1. Scale Invariant Feature Transform (SIFT)

We will use Scale Invariant Feature Transform (SIFT) as a handcrafted feature extractor.

SIFT identifies "good" local features by constructing a scale space so that these features are independant from scale, localizing distinct keypoints from the rest of the image to be considered a local feature, and assigning dominant orientations to localized keypoints making SIFT invariant to rotation.

Below we will set the class for the SIFT features extractor and the functions that we will use.

This first SIFT class is for training data.

Now, we will implement the sift_features class and use its functions to extract the sift features

We remove the skipped images from our data below

After extracting those features, we will try to visulize the keypoints on an image (e.g. the image number 0)

We can use keypoints in an image to match the found keypoints to keypoints found in another image using brute-force matching (BFMatcher)

We can also try using a different algorithm for matching keypoints between two images: FLANN (Fast Library for Approximate Nearest Neighbors) algorithm

FLANN is much faster than BFMatcher but it only finds an approximate nearest neighbor, which is a good matching but not necessarily the best

Now we can use the SIFT features matching to correct the mislabeled images in our training data

We run a loop with each image and compute the average number of matches under specific criteria with the rest of images corresponding to the same class

If the image does have enough matches, we give it the class label 0 of the imposter images

Ce also create a SIFT class for testing data

We use the SIFT features in order to improve the HAARProcessor for testing data

We pick all faces per image but we match them to our classes in training data. Based on that we only keep the best match

1.1.2. t-SNE Plots

We will use t-distributed stochastic neighbor embedding (t-SNE) to visualize the classification of the classes of images using our SIFT features

t-SNE is a statistical method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map

When visualizing the plot, we do not see any clear clusters. However, we can still see a little seperation between 1 (Jesse_Eisenberg) and 2 (Mila Kunis). The imposters (0) seem to be more dispersed over the scatter plot

We will set the color white to class 0 to better visualize the difference between the other two classes (1 and 2)

1.1.3. Discussion

We don't get a good classification using t-SNE. We see very little clustering of blue dots and pink dots seperately but there aren't really any clear clusters. The SIFT features worked well with both the Flann and Brute-force matchers so the features must be robust. A different classification method could perform better.

1.2. Baseline 2: PCA feature extractor

Principal Component Analysis (PCA) is a technique that reduces dimensionality and it is usually performed by eigenvalue decomposition of the covariance matrix. Another technique of PCA is through Singular Value Decomposition (SVD) and it is the one performed by scikit learn's implementation. Necessary preprocessing steps before PCA involve normalizing both train and test datasets, in order for all variables to have the same standard deviation, and thus not different weight.

Also removing the mean image of all images will enhance the difference between images. It means that clean difference can be learned in a better way from the model.

Let's try now to plot the same image using different number of principal components! As shown below, using many principal components, that correspond mathematically to eigenvectors, we reconstruct the image really well, whereas for lower number of principal components the image is more blurry. Remember to take care with the mean to really reconstruct the image.

xxxxxxxxxxxxx.png

In order to see how many principal components are optimal to be used in order not to lose useful information and at the same not have information that is not needed. By plotting the explained variance ratio compared to the number of components used and knowing that at least 97% approximately of the train data should be explained, we assume that there is no reason to use more than 60 principal components, since that will be more informative.

From the diagram above, we can see that having above around 65 components does not increase more the variability. It looks quite optimal already with 60 principal components.

EigenFaces Space

Projection of all the original images onto the first two principal components. We can see the difference in the images plot region, where it is noticiable the outliers, while the normal images are ploted in a more proximity region. Also it is a bit separeted by the person in the photo.

EigenFaces project based on the reconstructed images onto the first two principal components. The same idea of the previous plot, but now based on the reconstructed images.

2. Evaluation Metrics

2.0. Example: Accuracy

As example metric we take the accuracy. Informally, accuracy is the proportion of correct predictions over the total amount of predictions. It is used a lot in classification but it certainly has its disadvantages...

Two key evaluation metrics are considered:

Root Mean Square Error: takes a square root of the mean square error. Pros: -Fast and simple computation -Less penalty on errors during training

Cons: -Highly dependent on the data scale and normalizations used -Sensitive to outliers

Accuracy: an equivalent of top_k_categorical_accuracy is used. The reverse one-hot-encoding process as explained earlier provides a simplified version of binary accuracy between the classes.

Pro:

Pre-Classification

Scaling the data towards .. helps to.... Here, the standard Scaler of Sklearn is applied to ....Scale down to 0 mean and sigma = 1 for

Fit PCA on the training set

Check how much variance of data is explained, if you chose specific number of components.

Transform both datasets, using the PCA Model.

3. Classifiers

3.0. Example: The 'not so smart' classifier

This random classifier is not very complicated. It makes predictions at random, based on the distribution obseved in the training set. It thus assumes that the class labels of the test set will be distributed similarly to the training set.

4. Experiments

NOTE: Do NOT use this section to keep track of every little change you make in your code! Instead, highlight the most important findings and the major (best) pipelines that you've discovered.

4.0. Example: basic pipeline

The basic pipeline takes any input and samples a label based on the class label distribution of the training set. As expected the performance is very poor, predicting approximately 1/4 correctly on the training set. There is a lot of room for improvement but this is left to you ;).

The images will be classified based upon their features. Several techniques can be used for this. We experiemented with Random Forest, Logistic Regression, Neural Networks and Support Vector Machines.

PCA features have been testes with: Random Forest, Logistic Regression , Neural Networks and Support Vector Machines. SIFT Features have been tested with Support, Vector Machines and Neural Networks

Training an SVM classification model with the help of GridSearch in order to find the best hyperparameters. The best parameters found are printed below.

We use k-fold to test the training accuracy of the model

One-hot endoding for PCA

Classification with Neural networks on PCA

NN with modifications on the layers for PCA

Random Forest Classifier with PCA

Several tryouts of Support Vector Machines

Classification with SIFT FEATURES

Random Search with SVM

A hyperparameter Search wit RandomizedSearchCV was applied to identify the best parameters for a SVM model based on SIFT feature data.

One-hot enconding for a Neural Network

The feature representations for the features of SIFT have the following input form: after the one hot encoding, the sift features have the following form (rows, 3). Thus there are columns that define the different classes to be set.

Logistic Regression

5. Publishing best results

6. Discussion

We learned that there are different methods for feature extraction, there are different ways to extract the features. PCA was used to reduce dimensionality while preserving original information. However, we found that the technique deteriorated classifier performance when the number of principal components was chosen below 60 for a data set of only 80 images. This points to the fact that original data was high dimensional and requires additional preprocessing measures, as included in the code above.

PCA worked best with the deep learning model.

SIFT appears invariant to data preprocessing choices including scale, transpose or rotation. It also produced limited impact on the classification performance when images were enhanced or specific features were illuminated between contrasting pixels. SIFT worked best with the SVM-Poly and SVM-RBF classifiers.

Why is a performance sub-optimal? The features extracted with the SIFT method have 12800 columns. Whereas these extracted features might provide superior classification results, Neural Networks on based on the latter are associated with problems. Given the datas dimensionality, the Neural network has 12800 input features in our case For a fast result, this is not ideal. In general, some classificaton scores were below 0.6%. This essentially means that the probability for an image to be classified correctly is not sufficiently high and this is therefore not a reliable model.

In general, the classification algorithms based on the PCA features perform with higher accuracy than the ones on the SIFT features.

What would you do better/more if you would have plenty of time to spend on each step?

In summary we contributed the following:

7. Tried Improvements

After our implementation of image classification, we identified the following possibilities to improve the classification further:

MTCNN face detector:

Flipping and Enhancing the train images to get more and distinctive images.

To do the enhancement we follow a little tutorial found in the web.

This code, save as the train X the normal images, the flipped and also the enhanced ones, respecting the order of the images. (80 normal, 80 flipped and 80 enhanced).

Fot the train y we just concatenate the same data 3 times, to have the same size of train_X